Get the data

We will be attempting to find a linear regression that models college tuition rates, based on a dataset from US News and World Report. Alas, this data is from 1995, so it is very outdated; still, we will see what we can learn from it.


Question 1:

  1. The dataset is located at http://kbodwin.web.unc.edu/files/2016/09/tuition_final.csv; figure out how to use the code you were given last time for read.csv( ) and read.table( ) to read the data into R and call it tuition. Use the functions we learned last time to familiarize yourself with the data in tuition.
 #Your Code Here
tuition = read.csv("http://kbodwin.web.unc.edu/files/2016/09/tuition_final.csv")
head(tuition)
##      ID                              Name State Public Avg.SAT Avg.ACT
## 1  1061         Alaska Pacific University    AK      2     972      20
## 2  1063 University of Alaska at Fairbanks    AK      1     961      22
## 3  1065    University of Alaska Southeast    AK      1      NA      NA
## 4 11462 University of Alaska at Anchorage    AK      1     881      20
## 5  1002       Alabama Agri. & Mech. Univ.    AL      1      NA      17
## 6  1003               Faulkner University    AL      2      NA      20
##   Applied Accepted Size Out.Tuition Spending
## 1     193      146  249        7560    10922
## 2    1852     1427 3885        5226    11935
## 3     146      117  492        5226     9584
## 4    2065     1598 6209        5226     8046
## 5    2817     1920 3958        3400     7043
## 6     345      320 1367        5600     3971
summary(tuition)
##        ID                       Name          State         Public     
##  Min.   : 1002   Bethel College   :   4   NY     :101   Min.   :1.000  
##  1st Qu.: 1874   Concordia College:   4   PA     : 83   1st Qu.:1.000  
##  Median : 2650   Trinity College  :   4   CA     : 70   Median :2.000  
##  Mean   : 3126   Columbia College :   3   TX     : 60   Mean   :1.639  
##  3rd Qu.: 3431   Union College    :   3   MA     : 56   3rd Qu.:2.000  
##  Max.   :30431   Augustana College:   2   OH     : 52   Max.   :2.000  
##                  (Other)          :1282   (Other):880                  
##     Avg.SAT          Avg.ACT         Applied           Accepted      
##  Min.   : 600.0   Min.   :11.00   Min.   :   35.0   Min.   :   35.0  
##  1st Qu.: 884.5   1st Qu.:20.25   1st Qu.:  695.8   1st Qu.:  554.5  
##  Median : 957.0   Median :22.00   Median : 1470.0   Median : 1095.0  
##  Mean   : 968.0   Mean   :22.12   Mean   : 2752.1   Mean   : 1870.7  
##  3rd Qu.:1038.0   3rd Qu.:24.00   3rd Qu.: 3314.2   3rd Qu.: 2303.0  
##  Max.   :1410.0   Max.   :31.00   Max.   :48094.0   Max.   :26330.0  
##  NA's   :523      NA's   :588     NA's   :10        NA's   :11       
##       Size        Out.Tuition       Spending    
##  Min.   :   59   Min.   : 1044   Min.   : 1834  
##  1st Qu.:  966   1st Qu.: 6111   1st Qu.: 6116  
##  Median : 1812   Median : 8670   Median : 7729  
##  Mean   : 3693   Mean   : 9277   Mean   : 8988  
##  3rd Qu.: 4540   3rd Qu.:11659   3rd Qu.:10054  
##  Max.   :31643   Max.   :25750   Max.   :62469  
##  NA's   :3       NA's   :20      NA's   :39
  1. Make a new variable in tuition called Acc.Rate that contains the acceptance rate for each university. You may find it handy to use the variables ‘Applied’ and ‘Accepted’.
  # YOUR CODE HERE
tuition$Acc.Rate = tuition$Accepted / tuition$Applied
head(tuition)
##      ID                              Name State Public Avg.SAT Avg.ACT
## 1  1061         Alaska Pacific University    AK      2     972      20
## 2  1063 University of Alaska at Fairbanks    AK      1     961      22
## 3  1065    University of Alaska Southeast    AK      1      NA      NA
## 4 11462 University of Alaska at Anchorage    AK      1     881      20
## 5  1002       Alabama Agri. & Mech. Univ.    AL      1      NA      17
## 6  1003               Faulkner University    AL      2      NA      20
##   Applied Accepted Size Out.Tuition Spending  Acc.Rate
## 1     193      146  249        7560    10922 0.7564767
## 2    1852     1427 3885        5226    11935 0.7705184
## 3     146      117  492        5226     9584 0.8013699
## 4    2065     1598 6209        5226     8046 0.7738499
## 5    2817     1920 3958        3400     7043 0.6815761
## 6     345      320 1367        5600     3971 0.9275362
summary(tuition)
##        ID                       Name          State         Public     
##  Min.   : 1002   Bethel College   :   4   NY     :101   Min.   :1.000  
##  1st Qu.: 1874   Concordia College:   4   PA     : 83   1st Qu.:1.000  
##  Median : 2650   Trinity College  :   4   CA     : 70   Median :2.000  
##  Mean   : 3126   Columbia College :   3   TX     : 60   Mean   :1.639  
##  3rd Qu.: 3431   Union College    :   3   MA     : 56   3rd Qu.:2.000  
##  Max.   :30431   Augustana College:   2   OH     : 52   Max.   :2.000  
##                  (Other)          :1282   (Other):880                  
##     Avg.SAT          Avg.ACT         Applied           Accepted      
##  Min.   : 600.0   Min.   :11.00   Min.   :   35.0   Min.   :   35.0  
##  1st Qu.: 884.5   1st Qu.:20.25   1st Qu.:  695.8   1st Qu.:  554.5  
##  Median : 957.0   Median :22.00   Median : 1470.0   Median : 1095.0  
##  Mean   : 968.0   Mean   :22.12   Mean   : 2752.1   Mean   : 1870.7  
##  3rd Qu.:1038.0   3rd Qu.:24.00   3rd Qu.: 3314.2   3rd Qu.: 2303.0  
##  Max.   :1410.0   Max.   :31.00   Max.   :48094.0   Max.   :26330.0  
##  NA's   :523      NA's   :588     NA's   :10        NA's   :11       
##       Size        Out.Tuition       Spending        Acc.Rate      
##  Min.   :   59   Min.   : 1044   Min.   : 1834   Min.   :0.09139  
##  1st Qu.:  966   1st Qu.: 6111   1st Qu.: 6116   1st Qu.:0.68122  
##  Median : 1812   Median : 8670   Median : 7729   Median :0.78261  
##  Mean   : 3693   Mean   : 9277   Mean   : 8988   Mean   :0.75479  
##  3rd Qu.: 4540   3rd Qu.:11659   3rd Qu.:10054   3rd Qu.:0.86087  
##  Max.   :31643   Max.   :25750   Max.   :62469   Max.   :1.00000  
##  NA's   :3       NA's   :20      NA's   :39      NA's   :13
  1. Find and print the row of the data that corresponds to UNC (“University of North Carolina at Chapel Hill”).
  # YOUR CODE HERE
tuition[tuition$Name == "University of North Carolina at Chapel Hill",]
##       ID                                        Name State Public Avg.SAT
## 682 2974 University of North Carolina at Chapel Hill    NC      1    1121
##     Avg.ACT Applied Accepted  Size Out.Tuition Spending  Acc.Rate
## 682      NA   14596     5985 14609        8400    15893 0.4100438

Writing functions

We have seen many examples of using functions in R, like summary( ) or t.test( ). Now you will learn how to write your own functions. Defining a function means writing code that looks something like this:

my_function <- function(VAR_1, VAR_2){
  
  # do some stuff with VAR_1 and VAR_2
  return(result)
  
}

Then you run the code in R to “teach” it how your function works, and after that, you can use it like you would any other pre-existing function. For example, try out the following:

add1 <- function(a, b){
  
  # add the variables
  c = a + b
  return(c)
  
}

add2 <- function(a, b = 3){
  
  # add the variables
  c = a + b
  return(c)
  
}

# Try adding 5 and 7
add1(5, 7)
## [1] 12
add2(5, 7)
## [1] 12
# Try adding one variable
#add1(5)
add2(5)
## [1] 8

Question 2:

What was the effect of b = 3 in the definition of add2( )?

The default value for b is 3

Question 3:

  1. Recall that the equations for simple linear regression are: \[\beta_1 = r \frac{S_Y}{S_X} \hspace{0.5cm} \beta_0 = \bar{Y} - \beta_1 \bar{X}\]

Write your own functions, called beta1( ) and beta0( ) that take as input some combination of Sx, Sy, r, y_bar, and x_bar, and use that to calculate \(\beta_1\) and \(\beta_0\).

  # YOUR CODE HERE

beta1 = function(r, sY, sX) {
  if (identical(sX, 0)) {
    print("sX cannot be 0")
    return(NA)
  }
  return(r * sY / sX)
}

beta0 = function(r, sY, sX, y_bar, x_bar) {
  return(ybar - beta1(r, sY, sX) * x_bar)
}

beta1(1, 1, 0)
## [1] "sX cannot be 0"
## [1] NA
  1. Try your function with Sx = 0. Did it work? If not, fix your function code. Explain why it would be a problem to do linear regression with \(S_X = 0\).
Cannot divide by zero

Linear Regression by hand

Use the code below to make a scatterplot of college tuition versus average SAT score.

plot(tuition$Avg.SAT, tuition$Out.Tuition, main = "title", xlab = "label", ylab = "label", pch = 7, cex = 2, col = "blue")


Question 4:

  1. Make your own scatterplot, but change the input of plot( ) so that it looks nice.
  # YOUR CODE HERE
plot(tuition$Avg.SAT, tuition$Out.Tuition, main = "Tuition vs Average SAT Scores", ylab = "Tuition", xlab = "Average SAT Score", pch = 1, cex = 1, col = "blue")

  1. What do pch and cex do?
pch specifies the symbols used
cex indicates the amount by which plotting text and symbols are scaled relative to the default size
  1. We have used the function abline( ) to add a vertical line or a horizontal line to a graph. However, it can also add lines by slope and intercept. Read the documentation of abline( ) until you understand how to do this. Then add a line with slope 10 and intercept 0 to your plot.
  # YOUR CODE HERE
plot(tuition$Avg.SAT, tuition$Out.Tuition, main = "Tuition vs Average SAT Scores", ylab = "Tuition", xlab = "Average SAT Score", pch = 1, cex = 1, col = "blue")
abline(a=0, b=10)

  1. Does this line seem to fit the data well?
Not particularly well

Question 5:

  1. Use the functions you already know in R and the ones you created, beta1( ) and beta0( ), to find the slope and intercept for a regression line of Avg.SAT on Out.Tuition. Remake your scatterplot, and add the regression line.

(Hint: You may have some trouble finding the mean and sd because there is some missing data. Look at the documentation for the functions you use. What could we add to the function arguments to ignore values of NA?)

  # YOUR CODE HERE
tuition = tuition[!is.na(tuition$Out.Tuition) & !is.na(tuition$Avg.SAT),]
ybar = mean(tuition$Out.Tuition)
xbar = mean(tuition$Avg.SAT)
sY = sd(tuition$Out.Tuition)
sX = sd(tuition$Avg.SAT)
r = cor(tuition$Avg.SAT, tuition$Out.Tuition)
b0 = beta0(r, sY, sX, ybar, xbar)
b1 = beta1(r, sY, sX)
paste(ybar, xbar, sY, sX, r, b1, b0)
## [1] "9974.48109517601 967.196870925684 4001.20564002006 122.616914275463 0.606700481134289 19.7977041035616 -9173.79636530133"
plot(tuition$Avg.SAT, tuition$Out.Tuition, main = "Tuition vs Average SAT Scores", ylab = "Tuition", xlab = "Average SAT Score", pch = 1, cex = 1, col = "blue")
abline(a=b0, b=b1)

  1. What do you conclude about the relationship between average SAT score and a college’s tuition?
I can conclude that there is a correlation between average SAT score and out of state tuition

Question 6:

  1. Write a new function called predict_yval(X, Y, x_new) that takes as input a vector of explanatory variables (X), a vector of y-variables (Y), and a new x-value that we want to predict (x_new). The output of the function should be the predicted y-value for x_new from a regression line. (Hint: You can use functions inside functions.)
predict_yval <- function(X, Y, x_new) {
  X = na.omit(X)
  Y = na.omit(Y)
  sY = sd(Y)
  sX = sd(X)
  r = cor(X, Y)
  return(beta1(r, sY, sX)*x_new + beta0(r, sY, sX, mean(Y), mean(X)))
}
  1. Now find the average SAT score and tuition of UNC and of Duke, and compare their predicted values to the truth:
  # YOUR CODE HERE
duke_sat_avg = tuition[tuition$Name == "Duke University",]$Avg.SAT
unc_sat_avg = tuition[tuition$Name == "University of North Carolina at Chapel Hill",]$Avg.SAT
duke_tuition_avg = tuition[tuition$Name == "Duke University",]$Out.Tuition
unc_tuition_avg = tuition[tuition$Name == "University of North Carolina at Chapel Hill",]$Out.Tuition
duke_tuition_prediction = predict_yval(tuition$Avg.SAT, tuition$Out.Tuition, duke_sat_avg)
unc_tuition_prediction = predict_yval(tuition$Avg.SAT, tuition$Out.Tuition, unc_sat_avg)
duke_sat_prediction = predict_yval(tuition$Out.Tuition, tuition$Avg.SAT, duke_tuition_avg)
unc_sat_prediction = predict_yval(tuition$Out.Tuition, tuition$Avg.SAT, unc_tuition_avg)
cat(duke_sat_avg, unc_sat_avg, duke_tuition_avg, unc_tuition_avg)
## 1302 1121 18590 8400
cat(duke_sat_prediction, unc_sat_prediction, duke_tuition_prediction, unc_tuition_prediction)
## 10134.66 9945.208 16602.81 13019.43
  1. Would you say you are getting a deal at UNC? How about at Duke?
Not getting a good deal at Duke, getting a great deal at UNC

lm() and diagnostics

You now have functions to calculate the slope and intercept of a linear regression, and to predict values. As you might expect, R was already able to do this, using the function lm( ). In class, you saw how to read the output of lm( ). Run the following regression of Avg.SAT on Out.Tuition, and refamiliarize yourself with the output.

  # Make linear model
  my_lm = lm(Out.Tuition ~ Avg.SAT, data = tuition)
  summary(my_lm)
## 
## Call:
## lm(formula = Out.Tuition ~ Avg.SAT, data = tuition)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9809.9 -2234.6   204.6  2309.3 12336.3 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -9173.7964   914.3482  -10.03   <2e-16 ***
## Avg.SAT        19.7977     0.9379   21.11   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3183 on 765 degrees of freedom
## Multiple R-squared:  0.3681, Adjusted R-squared:  0.3673 
## F-statistic: 445.6 on 1 and 765 DF,  p-value: < 2.2e-16
  names(my_lm)
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"
  plot(tuition$Avg.SAT, tuition$Out.Tuition, main = "Tuition vs Average SAT Scores", ylab = "Tuition", xlab = "Average SAT Score", pch = 1, cex = 1, col = "blue")
  abline(my_lm)

  plot(my_lm)

Check out names(my_lm). This will give you a list of information we can access using $. For example, compare my_lm$coefficents to your beta1 and beta0 outputs.

The output of lm( ) is made to play nicely with other functions in R. For example, try adding abline(my_lm) to your scatterplot. We can also use lm( ) to check some common diagnostics, to see if the linear model is a good fit for the data. Try plot(my_lm), and familiarize yourself with the first three plots that are automatically generated. (The fourth is not covered in this course, so you do not need to worry about it for now.)


Question 7:

  1. The variable Spending contains the expenditure of the school per student. Suppose we want to make a regression that predicts tuition cost from the expenditure per student. Make a linear model for Spending versus Out.Tuition. Comment on the summary of the model, and plot it on an appropriate scatter plot. Does the model seem to be a good fit for this data?
my_lm = lm(Out.Tuition ~ Spending, data = tuition)
plot(tuition$Spending, tuition$Out.Tuition, main = "Tuition vs Spending", ylab = "Tuition", xlab = "Spending", pch = 1, cex = 1, col = "blue")
abline(my_lm)

The model is not a very good fit. There are extreme outliers that are skewing the result.
  1. Plot the residuals versus the values of Spending. Do you notice any issues? Hint: Use your own function predict_yval() or the built-in function predict(my_lm). You will need to think about the problem of missing data (NAs).
plot(na.omit(tuition$Spending), na.omit(my_lm$residuals), main = "Residuals vs Spending", ylab = "Residuals", xlab = "Spending", col = "blue")

  1. Use plot() to look at the automatic diagnostics. What is each plot showing? What seems to be going wrong here? Which schools are marked as outliers?
plot(my_lm)

The high-spending outlier schools are heavily skewing the model
  1. Roughly speaking, an outlier is “influential” if it changes the regression line when you remove it. Decide for yourself which data points are influential outliers. Recalculate the linear model without any outliers, and plot it on a scatterplot.
tuition_spending = tuition[tuition$Spending < 20000,]
my_lm = lm(Out.Tuition ~ Spending, data = tuition_spending)
plot(tuition_spending$Spending, tuition_spending$Out.Tuition, main = "Tuition vs Spending", ylab = "Tuition", xlab = "Spending", pch = 1, cex = 1, col = "blue")
abline(my_lm)


Question 8:

  1. Now suppose we want to make a regression that predicts tuition cost from the size of the student body. Make a linear model for Size versus Out.Tuition. Comment on the summary of the model, and plot it on an appropriate scatter plot, and use plot( ) to look at the diagnostics. Does the model seem to be a good fit for this data?
my_lm = lm(Out.Tuition ~ Size, data=tuition)
plot(tuition_spending$Size, tuition_spending$Out.Tuition, main = "Tuition vs Size", ylab = "Tuition", xlab = "Size", col = "blue")
abline(my_lm)

plot(my_lm)

length(tuition$Spending)
## [1] 767
There does not appear to be a significant correlation between size and tuition
  1. Remake your scatterplot, this time including the option col = tuition$Public. What did this change? Can you use this information to explain why the regression line in (a) did not fit well?
  # YOUR CODE HERE
  1. Make separate linear regressions of Size versus Out.Tuition for private schools and for public schools. Plot both of these, appropriately colored, on your scatterplot. Comment on the output and diagnostics.
  # YOUR CODE HERE

Multiple Linear Regression

We have seen that a college’s tuition relates to its size, its spending per student, and its average SAT score. We have also seen that this relationship may change based on whether the school is public or private. Ideally, instead of making separate regressions for each relationship, we could combine them all into a multiple regression. Fortunately, R makes this easy with lm().


Question 9:

  1. Run the following code to perform a multiple regression. Interpret the results.
  mult.1 <- lm(Out.Tuition ~ Size + Avg.SAT + Avg.ACT + Spending + Acc.Rate, data = tuition)
  
  summary(mult.1)
  1. We can also mix and match continuous variables with categorical ones. Let’s add Public to the regression. The following two models are slightly different, but give essentially identical output. What is the difference between them, and why is it important even though the output still the same?
  mult.2 <- lm(Out.Tuition ~ Size + Avg.SAT + Avg.ACT + Spending + Public, data = tuition)
  mult.3 <- lm(Out.Tuition ~ Size + Avg.SAT + Avg.ACT + Spending + factor(Public), data = tuition)
  1. It is still important to check diagnostics in a multiple regression, although it can be harder to track down the source of problems. Use plot( ) to look at diagnostics for mult.3, and comment.
  # YOUR CODE HERE

Question 10:

  1. A big problem in multiple regression is collinearity, which means that two or more explanatory variables are correlated with each other. Read the documentation for pairs( ), and then use it on the variables involved in mult.3. Hint: You can use the option col = tuition$Public in pairs( )
  # YOUR CODE HERE
  1. Do any of the variables seem strongly related? What is their correlation?
  # YOUR CODE HERE
  1. Explain in your own words why the correlation between the variables you discussed in (a) could be a problem.

Sneak Preview: Interaction Terms

We saw in 12c that whether a school is public or private can affect not only the tuition, but also how the tuition relates to other variables. In a multiple regression, this effect can be captured through interaction terms, which are expressed by var1:var2, and measure how much one variable changes the effect of the other.

Read the following paragraph from the documentation ?formula for some shortcuts for including interactions:

In addition to + and :, a number of other operators are useful in model formulae. The * operator denotes factor crossing: a*b interpreted as a+b+a:b. The ^ operator indicates crossing to the specified degree. For example (a+b+c)^2 is identical to (a+b+c)*(a+b+c) which in turn expands to a formula containing the main effects for a, b and c together with their second-order interactions. The %in% operator indicates that the terms on its left are nested within those on the right. For example a + b %in% a expands to the formula a + a:b. The - operator removes the specified terms, so that (a+b+c)^2 - a:b is identical to a + b + c + b:c + a:c. It can also used to remove the intercept term: when fitting a linear model y ~ x - 1 specifies a line through the origin. A model with no intercept can be also specified as y ~ x + 0 or y ~ 0 + x.

Question 11:

Create your own multiple regression that predicts tuition from whichever variables you choose, as well as some interaction terms between Public and other variables. Don’t worry about using any official methods to pick variables; simply try a few things and choose the model that seems best. Interpret the results; in particular, think very carefully about what the coefficient for an interaction term with Public might mean.

  # YOUR CODE HERE